Note IBS 70th Biometric Colloquium
Reference: https://github.com/tidyomics
Michael Love (University of North Carolina-Chapel Hill, USA)
Pragmatic Biometrics for Transcriptomics: Rigor, Reproducibility, and Readability
Major advances in sequencing and other biotechnologies have propelled the state of the art in transcriptomic measurement, to the current state of proling transcriptomes of individual cells, as well as the ability to directly sequence full RNA transcripts. Throughout changes in technology, accurate biometric analysis requires pragmatic choices in the processing and statistical modeling of transcriptomic measurements, guided by exploratory data analysis. I will discuss lessons learned from the past decade of transcriptomics, from rigorous bias correction, to automated mechanisms of ensuring reproducible analysis, and current eorts at facilitating code readability for data processing and analysis. I will conclude by suggesting how these lessons may be applied to data from new transcriptomic technologies.
Since the publication of the ICH E9(R1) document in 2019, estimation frameworks have become a fundamental component of clinical trial protocols. At the same time, complex innovative designs are becoming increasingly popular in drug development. However, it is unclear to what extent the estimation framework is applicable to these novel designs. For example, should each subpopulation (e.g., defined by cancer site) be assigned a different estimate in a basket trial? Or could a single estimate for the general population be used (e.g., defined by positivity for a certain biomarker)? In the case of a platform trial, should different estimates be presented for each drug studied? We discuss estimation considerations relevant to different types of complex innovative designs. We consider trials that allow adding or selecting experimental treatment groups, modifying control groups, and selecting or combining populations. We also address potential data-driven, adaptive selection of estimators in ongoing trials and address certain statistical issues related to estimating rather than being estimated, such as borrowing from non-concurrent information
From Traditional Analysis to the Estimand Framework
Traditional Analysis: Historically, the focus was primarily on the analysis itself, with less emphasis on the precise clinical questions that the analysis aimed to answer. Estimand Framework: This approach, as recommended by regulatory agencies like the FDA and EMA, focuses on clearly defining the target of estimation (the “estimand”) before deciding on the statistical methods for analysis. This framework encourages researchers to specify: The treatment effect of interest (e.g., difference in means, hazard ratio) The population in which this effect will be estimated The handling of post-randomization events (e.g., treatment discontinuation, use of rescue medication) The method of handling missing data
Key Concepts Discussed
Competing Risks: In survival analysis, a competing risk is an event that precludes the occurrence of the primary event of interest. Adjusting for competing risks is crucial to avoid biased estimates of the incidence of the event of interest. Censoring for Competing Risks: Traditional survival analysis methods, like Kaplan-Meier or Cox proportional hazards models, might not adequately handle competing risks. Alternative methods, such as the Fine and Gray model, are designed specifically for such scenarios. Estimand Framework in Practice: By defining an estimand that includes considerations for competing risks, researchers ensure that the analysis directly addresses the clinical question of interest. This involves specifying the treatment effect in the presence of competing events and may involve hypothetical strategies for what would have happened in the absence of these competing events. Time Frame of Interest: Not specifying the time frame implicitly ties the estimate of the treatment effect to the observed data distribution, which might not generalize well to other contexts. Terminology and Summary Measures: The discussion highlights issues with unclear terminology and the misuse of summary measures, such as risk and hazard ratios. It emphasizes the need for precise communication in clinical research to avoid misunderstandings about what the data show. Assumptions and Interpretability: The shift to the estimand framework also involves a critical look at the assumptions underlying statistical models (e.g., proportional hazards) and how these relate to the clinical questions of interest. It’s important to assess these assumptions at the planning stage of a trial to ensure they are appropriate for the estimand.
Solutions to Enhance Precise Treatment Effect Definitions
Clear Definition of Estimands: A crucial step is to adopt and rigorously apply the concept of estimands as outlined by regulatory agencies. This involves:
Clearly defining the target of estimation (estimand) before conducting the analysis. Specifying the metric of interest (e.g., difference in survival rates, hazard ratio), the population, and how post-randomization events are handled. Enhanced Communication of Study Objectives: Researchers should ensure that clinical trial objectives and the corresponding statistical analysis plans are communicated clearly, with explicit links between clinical questions and statistical methods.
Adopting Appropriate Statistical Methods: Depending on the trial design and the nature of the data, consider methods that appropriately address the complexities of the data, such as:
Competing risks analysis for survival data when appropriate, to provide a more accurate picture of the risks of different events over time. Multi-state models for more complex longitudinal data, allowing for the analysis of transitions between different states of health or disease. Use of Dynamic Prediction Models: These models can offer insights into how risk predictions evolve over time, providing personalized risk assessments for individuals based on their characteristics and treatment received.
Incorporation of Real-world Data: Complementing randomized controlled trial data with real-world evidence can help in understanding the long-term effects and generalizability of treatment effects.
Emphasis on Patient-reported Outcomes: Including and highlighting patient-reported outcomes in the analysis can offer valuable insights into the treatment effects from the patient’s perspective, particularly regarding quality of life and symptom management.
Flexible Analytical Approaches: Adopt flexible analytical strategies that can accommodate varying follow-up times and the dynamic nature of clinical data, such as time-varying effects models.
Training and Education: Encourage ongoing education and training for researchers and clinicians in the latest statistical methods and concepts, ensuring they are equipped to design, analyze, and interpret clinical trials effectively.
Interdisciplinary Collaboration: Foster collaboration between clinicians, statisticians, and data scientists to ensure that clinical questions are appropriately translated into statistical questions—and vice versa.
The example you provided about a heart failure trial highlights several important considerations in defining clinical questions of interest, especially when dealing with time-to-event endpoints in a trial. The discussion underscores the complexity and the need for precision in specifying the estimand in such trials. Let’s explore how to approach the estimand definition in this context, focusing on cardiovascular (CV) death while considering the presence of competing risks, such as non-cardiovascular death.
When defining clinical questions in a trial with time-to-event endpoints, it’s crucial to specify what aspect of the time-to-event data is of interest. This could include:
Each of these summary measures can provide different insights into the treatment’s effect, and the choice among them should be guided by the clinical question of interest, which may vary by stakeholder (patients, prescribers, payers).
In the heart failure trial example, where the primary outcome is a composite of CV death and heart failure hospitalization, with non-cardiovascular death as a key competing event, the approach to defining the estimand should carefully consider the following:
Conclusion
Defining the estimand in a clinical trial, especially one involving time-to-event endpoints and competing risks, requires careful consideration of the outcome of interest, the summary measures, the patient population, and how competing risks and intercurrent events are handled. This precise definition ensures that the trial’s findings are relevant and interpretable for all stakeholders, from patients to healthcare providers and payers.
The explanation provided delves into the complexities of defining and analyzing treatment effects in clinical trials, especially in the context of competing risks and time-to-event data. It underscores the importance of a nuanced approach to estimand definition, reflecting real-life scenarios where patients may experience various outcomes. Here’s a detailed breakdown to enhance understanding:
Clinical trials aim to mirror real-life scenarios where patients might die from disease-specific (CV death) or non-disease-specific causes (non-CV death). This complexity necessitates a comprehensive approach to estimand definition, taking into account the various ways patients’ paths can unfold following treatment intervention.
The presence of competing risks (e.g., CV vs. non-CV death) requires careful consideration in the analysis to accurately estimate the treatment effect on the event of interest. Summary measures should be chosen based on the clinical question and the relevant time frame, which could be:
When defining the estimand in a clinical trial setting, particularly with competing risks and time-to-event endpoints, several key attributes need to be specified:
In the heart failure trial example, the question of interest could be: “Compared to control, how much does the drug decrease the probability of CV death up to two years in heart failure patients who can also die from non-CV causes?”
This approach allows for a detailed analysis that reflects the different possible events a patient might experience, providing insights into the treatment’s effect in a real-world, clinically relevant manner. By distinguishing between different types of events and specifying the summary measure accordingly, researchers can offer more precise and meaningful interpretations of the treatment effects in clinical trials.
Your comprehensive approach to defining clinical questions and specifying estimands in the context of a heart failure trial with time-to-event endpoints is a valuable framework for researchers and clinicians alike. This methodical approach ensures that the clinical trial design is closely aligned with the objectives of the study and addresses the complexities inherent in such trials, particularly those involving competing risks like cardiovascular (CV) death and non-cardiovascular death. Let’s delve deeper into each aspect of this approach to further elucidate its significance and application.
The precision in defining the estimand is crucial for several reasons:
Competing risks present a unique challenge in time-to-event analysis because the occurrence of one event (e.g., non-CV death) precludes the occurrence of the event of interest (e.g., CV death). The traditional Kaplan-Meier method may overestimate the risk of the event of interest in the presence of competing risks. Therefore:
The choice of summary measures should reflect the clinical question of interest and the needs of different stakeholders. For example:
Intercurrent events, such as initiation of additional therapies or loss to follow-up, can significantly affect the interpretation of the treatment effect. Strategies to handle these events include:
Conclusion
The approach to defining clinical questions and estimands in trials with time-to-event endpoints, especially in the context of competing risks, is critical for the success and interpretability of clinical research. By carefully considering the outcome of interest, the summary measures, the patient population, and the handling of competing risks and intercurrent events, researchers can design trials that provide meaningful, actionable insights into the effects of treatments on patient outcomes. This rigorous approach enhances the relevance and applicability of clinical trial findings, ultimately contributing to better informed clinical decisions and improved patient care.
Principal stratification is a causal inference framework used to address post-randomization issues by categorizing individuals based on potential outcomes under different treatments, which are not affected by the actual treatment assignment. It’s particularly useful in trials where outcomes might be influenced by post-treatment variables.
Intercurrent events and competing risks are critical considerations in clinical trials, affecting the interpretation and estimation of treatment effects.
Collignon O, Schiel A, Burman CF, Ruifbach K, Posch M, Bretz F. Estimands and Complex Innovative Designs. Clinical Pharmacology & Therapeutics. 2022;112(6):1183-90.
By Michael Schomaker
Currently, there are limited options for estimating the effect of continuous variables and variables measured at multiple time points on outcomes (i.e., via dose-response curves). However, these situations may be relevant: in pharmacology, one may be interested in how outcomes (e.g. viral failure) in people living with HIV and on treatment for HIV change over time under different interventions (e.g. different drug concentration trajectory). . One challenge with causal inference from sustained interventions is that the positivity assumption is often violated. To address motivation violations, we develop projection functions that reweight and redefine interest estimates based on conditional support functions for the respective interventions. With these functions, we can obtain the desired dose-response curve in a region with sufficient support, as well as meaningful estimates that do not require positivity assumptions. We developed a plug-in estimator of type g calculation for this situation. These contrast with using g -computed estimators in a naive way, i.e. applying them to successive interventions without addressing positivity violations. These ideas are illustrated by longitudinal data from HIV-positive children treated with efavirenz-based regimens. Simulations show in which cases the naive g calculation method is appropriate, in which cases it leads to bias, and how the proposed weighted estimation method recovers alternative estimates of interest
The discussion centers on a highly specialized area of causal inference within the context of antiretroviral therapy, focusing on the effects of NNRTI drug concentration on patient outcomes, such as viral load suppression in children. This scenario encapsulates several key concepts and challenges in causal inference, especially when dealing with continuous interventions like drug concentrations, longitudinal observational data, and the presence of time-varying confounders.
Scientific Question Translation to Estimand: The core scientific question involves understanding how varying drug concentrations (a continuous intervention) influence the probability of a specific outcome, such as viral load failure. This translates into estimating the causal effect of different hypothetical concentration trajectories on the outcome.
Causal Concentration-Response Relationship: The goal is to elucidate the dose-response curve, which in this context is the relationship between drug concentration and the probability of achieving viral load suppression.
Time-Varying Confounders: Variables such as weight and adherence to medication can affect both the drug concentration (intervention) and the outcome (viral load suppression), and these confounders can change over time. This complexity introduces biases that traditional regression methods cannot adequately address.
Treatment-Confounding Feedback Loop: The dynamic interplay between the intervention (drug concentration) and time-varying confounders exemplifies a situation where the treatment itself can influence the confounders, which in turn affect future treatment levels, creating a feedback loop.
Positivity Violation: In the context of continuous interventions, the assumption of positivity—every participant has a nonzero probability of receiving every possible level of the intervention under study—is often violated. This is because it’s impractical or impossible for some individuals to achieve certain drug concentrations due to physiological or other constraints.
G-Methods: Generalized methods like G-computation and inverse probability of treatment weighting (IPTW) have traditionally been applied to binary or categorical interventions. Extending these methods to continuous interventions presents unique challenges but offers a pathway to estimate causal effects in complex scenarios.
Doubly Robust Estimators: These estimators combine the strengths of outcome modeling and inverse probability weighting to offer more reliable estimates, especially in the presence of model misspecification. Edward Kennedy’s work on doubly robust approaches for continuous interventions represents a significant advancement in this area, providing a method to handle the complexity of such data while mitigating the effects of positivity violations.
While not the focus of today’s discussion, modified treatment policies represent another strategy to address the challenges of causal inference with continuous interventions. By adjusting the estimand or the analysis approach, researchers can reduce reliance on strict assumptions like positivity, offering a “technical trick” to navigate the complexities of these studies.
The case study of NNRTI drug concentration effects in an antiretroviral therapy context underscores the intricate challenges and methodological considerations in causal inference for continuous interventions. Addressing these challenges requires sophisticated statistical techniques that can account for time-varying confounders, treatment-confounder feedback loops, and the inherent difficulties of positivity violations. Advances in methodologies like G-methods and doubly robust estimators provide promising avenues for researchers to obtain more accurate and reliable causal estimates, thereby enhancing our understanding of continuous intervention effects in real-world settings.
The tradeoff in causal inference for continuous interventions, such as drug concentration levels, is clearly outlined. Let’s break it down into more detail.
Objective: The goal is to estimate the causal relationship between the continuous intervention (drug concentration) and the outcome (e.g., viral load suppression) as accurately as possible.
Challenge of Positivity Violations: In continuous interventions, ensuring that every possible level of intervention has a non-zero probability can be difficult. This is known as the positivity assumption. Violations occur when some levels of the intervention are not represented in the population, or are so sparsely represented that reliable estimation becomes problematic.
Objective: To reduce the risk of bias in estimation caused by the lack of data for certain levels of the continuous intervention.
Tradeoff: This approach may involve redefining the estimand or research question, potentially moving away from the original scientific inquiry. This could be achieved by using modified treatment policies or focusing the analysis on more densely populated regions of the intervention space.
Strategy: Finding a balance between the two extremes—sticking closely to the original estimand while managing the risk of positivity violations. This might involve accepting some bias in exchange for maintaining the integrity of the research question or utilizing statistical methods that can handle sparser data.
G-Methods: Including techniques like G-computation and probability of treatment weighting, these methods have been extensively used for binary and categorical interventions. Extending them to continuous interventions is more complex and less explored.
Sequential G-Computation: A method used to estimate the expected outcome over time, given a series of interventions. It involves iteratively predicting outcomes based on past observations and specified interventions at each time point.
Modified Treatment Policies: These involve altering the estimand to fit within the regions of the intervention space where positivity holds. It’s a way to circumvent the positivity violations by focusing on more realistic or likely intervention levels.
Doubly Robust Estimators: A contemporary approach that combines outcome modeling with inverse probability weighting to protect against certain types of biases, including those stemming from positivity violations.
In the example, the simpler linear dose-response scenario showed that ignoring positivity can sometimes work without introducing bias. However, as the scenario becomes more complex (e.g., survival settings with multiple time points), the risk of bias increases, especially for sparse data regions. Large sample sizes or advanced estimation techniques are necessary to accurately estimate the dose-response curve.
The main takeaway is that researchers must carefully consider their objectives and the limitations of their data when choosing a method for estimating causal effects in the presence of continuous interventions. There is no one-size-fits-all solution; the best approach may vary depending on the specific context and the availability of data. Finding a compromise between rigorous estimation and practical considerations is often necessary to advance scientific understanding while acknowledging methodological constraints.
Intricacies of estimating causal dose-response curves (CDRC) in the presence of continuous interventions, considering the challenges posed by positivity violations. Here’s a detailed breakdown of the processes and strategies mentioned:
Estimation Challenge: When estimating the CDRC, the task is to map out the expected outcome (e.g., viral load suppression) across a range of continuous intervention values (e.g., drug concentrations). However, this becomes challenging when the data for certain intervention levels (especially lower concentrations) is sparse, leading to potential bias.
Positivity Violations: These occur when the probability of observing a particular level of the intervention, conditional on the covariates, is low or zero, making it difficult to estimate the CDRC accurately for those levels.
Standardization and G-Formula: The G-formula or g-computation involves standardizing with respect to confounders and integrating them out. This process typically assumes that there is sufficient data across the entire range of the intervention.
Weighting Strategy: To address areas with insufficient data, a weighting function is introduced where the weight is set to one if the conditional treatment density is sufficient. Otherwise, the weight is a ratio of the conditional and marginal treatment densities. This approach helps to ensure the estimation remains reliable in areas with adequate data and minimizes bias in sparse data regions.
Extension to Multiple Time Points: The weighting strategy can be extended to cases where there are multiple time points by adjusting the weights according to the presence of conditional support for the treatment densities over time.
Trade-off: The main trade-off involves choosing between accurately estimating the CDRC and minimizing the risk of bias due to positivity violations. One can either maintain the original estimand and accept some bias or modify the estimand to reduce the bias, potentially at the cost of altering the research question.
Compromise Approach: The compromise involves refining the CDRC estimation to focus on regions with sufficient data support, thereby sticking to the actual restriction as much as possible. This strategy aims to provide accurate estimates where possible while acknowledging and adjusting for areas with less data.
Intervention Strategy: The intervention is defined in terms of the outcomes it produces, focusing on individual concentration trajectories that generate typical outcomes for the specific patient regimen.
Weighted Curve as a Sensitivity Tool: The weighted curve acts as a magnifying glass to assess the CDRC in regions with adequate support. This approach avoids relying on parametric extrapolations in regions with sparse data, providing a more reliable estimate of the causal effects where the data is sufficient.
Balance Between Causation and Association: The method strikes a balance between pure causal inference (which may not be fully possible due to data limitations) and association measures, providing a practical and interpretable estimation of the CDRC that is informed by the data available.
In summary, the process described is a sophisticated method for addressing the complexity of estimating causal effects with continuous interventions in longitudinal data. It incorporates advanced statistical techniques to manage the trade-offs inherent in such analyses, aiming to produce the most accurate and reliable estimates within the constraints of the observed data.
Causal Inference for Continuous Multiple Time Point Interventions https://arxiv.org/abs/2305.06645v2
A diagnostic test provides a statement about an individual’s target condition. This target condition is evaluated based on diverse clinical information such as symptoms, laboratory values, or physical examinations.[1] Diagnostic accuracy studies assess the precision of a diagnostic test. In these studies, the diagnostic test is compared to the true state, defined by the reference test. Based on the result of the reference test, patients can be assigned to the two target conditions.
A diagnostic study is performed to estimate test accuracy in daily practice. If the study design deviates from practical use, the estimated test accuracy may be biased. Therefore, components of the study objective must be defined a priori to avoid discrepancies between the study and daily practice. For example, the study population should be selected based on the target population. Moreover, various interfering events could occur that can lead to either non-existent test results or influenced test decisions. It should be determined how to handle these events.
The trial objective must be formulated during the planning of a diagnostic study.[2] This objective is then translated into the clinical question of interest. For treatment studies, the estimand framework consists of different attributes to define the estimand, which must be aligned with the stated clinical question of interest.[3] We will present an estimand framework for diagnostic studies, including the attributes target population, index test, target condition, accuracy measurement, and strategies for interfering events.
To illustrate this framework, we will present an application example evaluating a computed tomography (CT) scan to detect lung carcinoma. We will define the estimand for this study and discuss several potential interfering events and strategies to handle them, such as premature termination of the CT scan due to coughing.
Study objective: To assess the accuracy (sensitivity and specificity) of computed tomographic coronary angiography (CTCA) to detect coronary artery diseases (CAD) in symptomatic patients with clinical indications for coronary imaging who go through the procedure completely according to the instructions. A nonassessable segment in the CTCA is counted as a positive test result.
It emphasizes the importance of planning and executing diagnostic studies to ensure that the results are representative of real-world clinical situations and can inform appropriate treatment decisions. Here is a detailed breakdown of the points mentioned:
In diagnostic studies, various factors can complicate the interpretation of test results. An “inconclusive test result” is a common issue where it remains uncertain whether a patient has the disease in question. Other potential complications include communication errors before the test procedure that might affect outcomes or interruptions due to various reasons. For instance, “accidental unblinding” occurs when the result of the reference standard is known before conducting the index test, which could introduce bias.
The term “interfering events” refers to these kinds of complications. Although the original study authors may not have used this specific term, it is crucial to recognize and identify potential interfering events within the data. Examples of such events include:
In the specific context of the study under discussion, the authors noted “non-assessable segments,” indicating uncertainty about the disease’s presence. Additionally, there were instances of “incomplete or incorrect procedures” and “protocol deviations.” To understand these better, it is necessary to investigate the potential reasons behind them. For example:
Identifying and categorizing these interfering events is vital for understanding their impact on the study’s outcomes and ensuring that the diagnostic accuracy is assessed correctly. It also helps in designing future studies to mitigate such events and improve the reliability of diagnostic procedures.
In the realm of diagnostic studies, developing strategies to address interfering events is crucial for maintaining the integrity of the study results. There are several potential strategies to consider, and it is possible to apply different strategies to different interfering events or to use a single strategy across all events. Here is a detailed explanation of how these strategies could be applied to a hypothetical diagnostic study:
Variety of Strategies: There are multiple strategies available for addressing interfering events in diagnostic studies. Some of these strategies may have been derived from established practices, while others might be newly developed for specific scenarios.
Customizing Strategies for Each Event: Each interfering event can be managed with a tailored strategy that fits the particular challenge it presents. This allows for a nuanced approach that can handle the complexity of real-world diagnostic scenarios.
Consistency Across Events: Alternatively, one might choose a consistent strategy for all interfering events to simplify the study design and analysis. However, this could potentially overlook the unique aspects of different types of interference.
Let’s consider the application of these strategies in the context of a study aiming to evaluate a diagnostic test:
Non-Assessable Segment Strategy: In the scenario where a test segment is non-assessable, one might decide to count all such segments as positive test results. This is known as the “indicator we then” strategy, which simplifies the interpretation by treating all indeterminate results as indicative of the disease.
Omission of Certain Events: For other interfering events, the study might opt not to include them in the analysis. This “principles that” strategy involves excluding records associated with interfering events, thereby focusing only on the unambiguous data.
Secondary Estimand: To address the limitations of the above approaches, a secondary estimand can be defined that proposes alternative strategies for each interfering event:
Conclusion
In summary, choosing the right strategy or strategies to address interfering events in diagnostic studies is essential for achieving accurate and reliable results. The strategy must align with the study’s objectives and account for the potential impact of each interfering event on the diagnostic accuracy. By considering hypothetical, treatment-dependent, or policy-based strategies, researchers can navigate the complexities of diagnostic testing and produce results that are both informative and applicable to clinical practice.
Support in Planning: The estimand framework provides robust support in the planning phase of diagnostic studies, facilitating a clear definition of study objectives.
Interdisciplinary Exchange: It encourages collaboration between statisticians and clinicians, promoting a comprehensive understanding of the diagnostic process and the factors affecting it.
Event Identification and Weighting: The framework allows for the identification of different interfering events and provides a method for weighting their impact on the study’s accuracy.
Real-world Applicability: By carefully defining the study objective and the estimand, the results of the study can be more easily translated to real-world settings, enhancing their practical utility.
Theoretical Constraints: There is a risk of defining multiple estimands prior to the study, some of which may only be theoretical and not practically informative. This could lead to estimands that are not achievable or relevant in real-world settings.
Increased Planning Effort: The application of this framework requires significant planning and coordination between statisticians and clinicians, which can be resource-intensive.
by Klein, Stefan; Friedrichs, Frauke; Kunz, Michael
The estimand strategy will be presented for an ongoing phase 2a study in atopic dermatitis, comparing an active drug vs. placebo in 72 planned patients, with EASI-75 response as the primary endpoint. We will compare our estimands against the proposals given in Bissonnette et al. (1) for estimands in atopic dermatitis (AD), as well as against the estimand strategy in published phase III studies in AD.
In this particular example, we will discuss some aspects of the estimand concept in proof-of-concept (PoC) studies. Specifically, we will explore scientific objectives in early-phase studies, the impact of different intercurrent event strategies in PoC studies, the use of principal stratum strategy, and a useful choice of analysis population in PoC studies. We will also consider the availability of information on the types of intercurrent events that might occur, methods to keep the assignment of patients to datasets and the assignment of data point sets unbiased, and how the use of the hypothetical strategy might change the choice of endpoints in PoC studies.
Today’s discussion focuses on implementing the estimand framework in the initial stages of clinical drug development, specifically during proof of concept and proof of mechanism studies. Our department bridges the gap between preclinical research and later development stages, which requires us to address different questions than those typically encountered in the latter phases.
One key aspect of early development is validating hypotheses received from preclinical researchers. They often propose that a new compound is effective, and our task is to substantiate or refute this claim. This represents the first scientific question we tackle, which I refer to as “Scientific Question #1.” Our goal is to establish that the drug can produce the expected effect under ideal conditions.
However, that’s not the only inquiry we pursue. Another crucial aspect is determining whether the anticipated effect observed in ideal conditions would also hold in phase three trials or in real-world settings. This constitutes the second scientific question we must address.
In my view, addressing Scientific Question #1 is the primary goal of a proof of concept trial because there are scenarios where extrapolation to later-stage conditions is challenging. This difficulty arises, for example, when relying solely on surrogate endpoints or if the clinical endpoint used in the PoC study may not fully predict outcomes in later phases. Thus, confirming efficacy under ideal conditions is paramount.
Although there is existing literature on estimands in atopic dermatitis, the focus is predominantly on intercurrent events. The key is usually to understand how these events can affect the interpretation of trial results and how to account for them when planning and conducting early-phase trials.
In summary, early development studies using the estimand framework must grapple with two central scientific questions: first, proving the compound’s efficacy in ideal conditions, and second, ensuring that the effects observed are likely to be replicated in broader clinical trials and real-world applications. This dual focus necessitates a thorough planning process that can accommodate the unique challenges of early clinical development.
Treatment and Interventions: The use of topical applications such as emollients or moisturizers is standard in managing AD. These are often fundamental in trials to assess their efficacy or as adjunct therapies to primary treatments.
Differentiation of Medication Use: Stressing the importance of accurately recording which medications were used, when they were used, and distinguishing between disease-specific reasons for their use. This is crucial to avoid introducing bias into the study outcomes.
Study Design: You mention a “double-blind, randomized, placebo-controlled” design, which is the gold standard for clinical trials. This design minimizes bias, ensures the reliability of the results, and allows for the clear determination of the treatment’s efficacy.
Follow-up Duration: The mention of follow-up assessments at 2, 4, 8, and 12 weeks indicates a thorough approach to monitoring the progression of AD and the treatment’s efficacy over time.
Endpoints and Assessment: Utilizing endpoints from existing guidelines, such as the Eczema Area and Severity Index (EASI), facilitates standardization and comparison across studies. The goal of achieving a significant improvement (e.g., EASI-75) is a common endpoint that indicates a substantial benefit to the patient.
Intercurrent Events: Addressing how to handle events such as the use of rescue medication is vital. Your approach to differentiate based on the timing of such interventions (e.g., before or after a certain week) and the adoption of different strategies (e.g., treatment policy strategy versus non-responder imputation) for different timing reflects the complexity of managing and interpreting these events in the analysis.
Atopic Dermatitis Overview: You briefly describe AD, highlighting its chronic nature, typical onset in childhood, and standard treatments. This context is essential for understanding the importance of the trial and the impact of the disease on patients.
Primary Estimate and Scientific Questions: Focusing on the primary estimate related to a specific scientific question underlines the trial’s goal to understand the effect of the treatment under ideal conditions. The differentiation in handling intercurrent events like the use of rescue medication underscores the meticulous planning required to interpret the trial’s results accurately.
Transplant and Hypothetical Strategy
Defining the Efficacy Population
Addressing Biases and Ensuring Rigorous Analysis
-Supplementary Analyses: Conducting additional analyses to compare the ‘efficacy’ and ‘as-fast’ populations helps identify any potential biases introduced by the selection criteria for the efficacy population. - Operational Efficacy: Emphasizing precise and consistent operational procedures during the trial aims to minimize the need to exclude patients from the analysis, thereby reducing potential biases.
Trial Design Considerations - Patient Engagement: Design strategies that encourage patient adherence to the treatment protocol, thus minimizing early discontinuations and noncompliance, are crucial. This not only aids in maintaining a robust analysis population but also reflects the trial’s real-world applicability.
Reflections on the Strategy
Simplified Estimate for Phase III Trials: For phase III trials, a more straightforward approach is often utilized, focusing on the treatment policy strategy without stringent considerations for compliance. This approach, incorporating the full analysis (FAS) population and accommodating rescue medication intake, aims to reflect more closely the real-world application of the intervention. It’s designed to capture the treatment’s effect under conditions that are more representative of routine clinical practice.
Real-World Applicability: By adopting a treatment policy strategy and a broader analysis population, the results from phase III trials are expected to offer insights into the treatment’s effectiveness in a real-world setting. This inclusivity helps stakeholders understand how the drug might perform outside the controlled conditions of a clinical trial.
Planning Challenges: Early-phase trials, with their limited prior data, present unique challenges in planning, especially regarding the timing of treatment effects and the impact of potential intercurrent events (e.g., discontinuations, rescue medication use). This uncertainty necessitates a more cautious and flexible approach to trial design and estimand selection.
Complexity of Implementing Estimands: The use of the estimand framework in early trials requires meticulous planning and clear definitions in the study protocol or statistical analysis plan (SAP). The complexity of adjudicating treatment-related discontinuations or the role of concomitant medications underscores the need for thorough pre-trial planning and possibly the establishment of an adjudication committee.
Adjudication and Bias Avoidance: Ensuring unbiased outcomes in the presence of intercurrent events necessitates blinding adjudicators to efficacy data, a challenging proposition in smaller studies. This highlights the importance of operational integrity and unbiased decision-making in the interpretation of trial results.
Importance of Rigorous Planning
Clarity in Objectives: The estimand framework provides a structured approach to define what the trial aims to estimate, offering clarity on the treatment’s intended effects and how intercurrent events will be addressed.
Data Utilization: By clearly defining the treatment effect of interest and how to handle deviations from the protocol, studies can minimize data wastage, which is especially critical in small-scale trials where every data point is valuable.
Informed Decision-Making: With a clearer understanding of the treatment’s potential effects, stakeholders are better positioned to make informed decisions about the development pathway and the applicability of the treatment in a real-world context.
Reflections on the Addendum and Early Trials
Your observations suggest that while the estimand framework enhances the precision and applicability of clinical trial results, its implementation in early-phase trials, as outlined in regulatory addendums, presents practical challenges. The ideal world condition, a crucial concept within the framework, can be particularly challenging to operationalize in early trials due to the unpredictable nature of new treatments and limited prior data.
Bissonnette R, Eichenfield LF, Simpson E, Thaci D, Kabashima K, Thyssen JP, et al. Estimands for atopic dermatitis clinical trials: Expert opinion on the importance of intercurrent events. J Eur Acad Dermatol Venereol. 2023;37:976–983.
by Holovchak, Anastasiia; Schomaker, Michael
Missing data in multiple variables is a common problem. As part of the CHAPAS-3 trial, we investigated the applicability of a graphical modeling framework for handling missing data to a complex longitudinal pharmacology study of HIV-positive children treated with an efavirenz-based regimen. Specifically, we examine whether the causal effect of interest, defined by a static intervention on multiple continuous variables, can be recovered (consistent estimation) from the available data alone. To date, there is no general algorithm for determining recoverability and decisions must be made on a case-by-case basis. We emphasize the sensitivity of recoverability to even minimal changes in graph structure and present recoverability results for three potentially missing DAGs in the CHAPAS-3 study (Directed Acyclic Graphs) based on clinical knowledge. Furthermore, we propose the concept of closed missingness mechanisms and show that under these mechanisms the available case analysis can be used for consistent estimation of any kind of statistical and causal queries, even if the underlying missingness mechanism is of the MNAR type. The simulation demonstrates how the estimation results change depending on the modeled missing DAG. Our analysis may be the first to show the applicability of missing DAGs to complex longitudinal real-world data, while highlighting sensitivity to hypothesized causal models.
Main DAG and Alternative DAG: These diagrams represent different assumptions about the relationship between variables, including the missingness mechanism. The main DAG reflects the initial assumption, while the alternative DAG incorporates additional considerations, such as the impact of health status on missing visits.
Causal Concentration-Response Curve: The objective is to estimate the counterfactual probability of viral failure over time based on different trajectories of drug concentration. This involves intervening on the drug concentration to see how it would influence viral load, assuming constant exposure throughout the study.
Identifiability: This concept relates to whether a causal effect can be determined from the observed data distribution. In the absence of missing data, standard causal inference methods, like the g-formula or the back-door criterion, can be used to identify the causal effect.
Recoverability: This extends the concept of identifiability to situations with missing data. It assesses whether the causal effect can be estimated from the available (incomplete) data. A key prerequisite for recoverability is the identifiability of the effect in a complete data scenario.
Estimating Conditional Distributions: When data are missing, techniques such as imputation, inverse probability weighting, or other methods might be used to estimate the required conditional distributions. This process relies heavily on the assumptions made about the missingness mechanism and the relationships between variables as depicted in the DAGs.
Consistency Assumption: This assumption states that the counterfactual outcome for an individual, had they received the treatment, coincides with the observed outcome if they indeed received the treatment. This is fundamental for linking the counterfactual model with observable data.
Sensitivity to Missingness Mechanisms: Your study aims to explore how sensitive the results are to changes in assumptions about the missingness mechanism. This is crucial for understanding the robustness of your findings and their applicability to real-world settings.
Complexity of Estimation: The complexity arises from the need to estimate several distributions, each potentially affected by missing data. This necessitates a careful consideration of the underlying assumptions and the selection of appropriate statistical methods.
Comparison of Analytical Approaches: The comparison between available case analysis and multiple imputation highlights a fundamental challenge in causal inference studies dealing with missing data. You found that while available case analysis yielded results consistent with the true effect under the main DAG assumption, multiple imputation did not, owing to its incompatibility with the MNAR assumption.
Theoretical vs. Practical Recoverability: The distinction between theoretical recoverability (the potential to estimate the causal effect accurately with perfect knowledge and modeling of the missingness mechanism) and practical recoverability (the real-world feasibility of such estimation) is crucial. Your findings suggest that even small discrepancies in the missingness DAG can lead to significant differences in recoverability and, by extension, the accuracy of the estimated causal effect.
Randomized trials are considered the gold standard for minimizing confounding factors in treatment effect estimates. However, data from randomized treatment groups are not always available, e.g. when comparing pooled cohorts from different studies, and other methods are needed to control for confounding. In addition to using multiple regression models including confounding variables to control for confounding, a propensity score (PS) method was developed to balance baseline characteristics between treatment groups. The purpose of this case study is to compare adjustment for confounding using multiple regression models alone and combined with optimal matching and inverse probability of treatment weighting (IPTW), with or without multiple imputation (MI), due to missing values in some data The percentage of confounding variables is high. In particular, the focus is on the applicability of methods and differences in their results.
To compare statistical methods for controlling for these confounds, we used two pooled cohorts from two different settings (clinical trials versus usual care) receiving specific treatments for mantle cell lymphoma (MCL). The aim was to compare the clinical outcomes of these two treatment regimens. Multiple Cox regression models were adjusted for the following relevant prognostic factors: MIPI score (clinical prognostic score), Ki67 (cell proliferation marker), and cytology (alone or in combination). Additionally, 1:1 ratio PS Best Match and IPTW are applied. PS was calculated using multiple logistic regression, including gender and MIPI score. Due to the high proportion of missing values in Ki67 and cytology, the PS method was combined with MI using multivariate imputation of chained equations (MICE). After imputing missing values in Ki67 and cytology, these variables can be added to the logistic regression model for PS calculations.
The balance statistics of relevant prognostic factors included in the logistic regression model for PS calculation were slightly better after IPTW than after best matching, especially for Ki67 and cytology when combined with multiple imputation. However, the balancing of individual MIPI variables in both PS methods is still insufficient, which is contrary to the purpose of their application. All analyzes consistently showed that there were no significant differences between the clinical outcomes of the two treatment regimens and that the hazard ratios and their 95% confidence intervals differed minimally between the different confounding control analyses. In summary, this case study demonstrates that PS methods may not always be a suitable alternative to randomization.
by Großhennig, Anika1; Koch, Armin1; Beutel, Gernot2; Theodor, Framke1
1Institute of Biostatistics, Hannover Medical School, Germany; 2Department of Hematology, Hemostasis, Oncology and Stem Cell Transplantation, Hannover Medical School, Germany
Already since the first half of the 20th century randomization, replication, and blocking are the three established principles in the design of experiments. They were advocated by Ronald Fisher who also made statistical analysis methods accessible to a broader scientific audience in various editions of his book Statistical Methods for Research Workers, e.g. (1). Although reminders for the need for randomization have been published repeatedly, e.g. (2, 3), there are still many non-randomized trials. Particularly in rare diseases, where the available sample size for a clinical trial is restricted, control groups are often omitted with this argument. To illustrate the importance of randomization we discuss a case study in the field of field of stem cell transplantation (4). Different designs for one and the same research questions and hypothetical and real results are reported and discussed. We argue that randomization techniques should be implemented routinely whenever applicable and even if the available number of patients is small. This discussion is especially relevant in the context of the current draft EMA Reflection Paper on single-arm trials (5).
Clinical Trial Background: The text begins with a reflection on the outcomes of a clinical trial involving a treatment referred to as “Clara C”. Initially, there was optimism because the two-year overall survival rate was 10% better than what was expected based on planning assumptions from the literature. However, the improvement, while better than expected, was not considered significantly impactful. This led to a nuanced evaluation of the treatment’s effectiveness and the difficult decision-making process regarding the future of the project after eight years of work.
Randomized Controlled Trial (RCT) Design: The significance of conducting an RCT is highlighted. RCTs are considered the gold standard for evaluating the efficacy of medical interventions because they minimize bias, allowing for a more accurate comparison between the treatment under investigation and the standard of care. In this scenario, an RCT was conducted with a control group receiving standard care and another group receiving the Clara C treatment.
Trial Outcomes and Interpretation: The results of the RCT showed that the standard of care performed better than the Clara C treatment, not only failing to demonstrate the superiority of Clara C but also indicating that the standard of care was more effective, although not significantly. This outcome was unexpected and raised questions about the assumptions made during the planning phase of the trial.
Comparison to Pembrolizumab Case: The text then draws a parallel to the case of pembrolizumab, a drug known for its effectiveness in several indications but which failed to meet post-marketing requirements in a specific indication during a randomized control trial. This example underscores the importance of RCTs in validating the efficacy of treatments, even those already approved based on non-comparative trials.
Challenges in Clinical Research: The discussion emphasizes the ongoing challenge in convincing stakeholders of the importance of RCTs. Despite their critical role in generating reliable data, obstacles such as convincing physicians, managing small population sizes in trials, and obtaining unbiased estimates of treatment effects persist.
Implications for Future Research and Regulatory Guidance: The narrative concludes with a call to recognize the importance of RCTs for obtaining unbiased and interpretable results, especially in small population studies. It suggests that relying solely on non-randomized trials or single-arm studies may not provide a comprehensive understanding of a treatment’s efficacy, highlighting the necessity of RCTs for conclusive evidence.
Trial Design Comparison: The conversation starts by comparing different research strategies, specifically the use of single-arm trials versus initiating studies with RCTs. The speaker highlights the challenges in interpreting results from single-arm studies due to their inherent lack of a control group, which makes validating the outcomes against planning assumptions difficult.
Thought Experiment on Trial Designs: A thought experiment is presented where if research on Clara C had started with a single-arm study, the process to reach a conclusive result about the treatment’s effectiveness would be lengthy and require a subsequent trial to validate promising results. This approach is contrasted with starting directly with an RCT, which, despite being underpowered, provides an unbiased estimate of the treatment effect and allows for a benefit-risk assessment.
Efficiency and Efficacy of RCTs: The speaker argues that starting with an RCT, even with a limited number of participants, is more time and resource-efficient than the sequential approach of a single-arm study followed by an RCT. This efficiency comes from the direct comparison allowed in RCTs, providing crucial data on the treatment’s efficacy and safety in a shorter timeframe.
Randomization and Its Importance: Emphasis is placed on the value of randomization in clinical trials. Randomization is critical for minimizing bias and ensuring that the treatment effects observed are not due to underlying differences between the groups. The speaker acknowledges exceptions where dramatic effects are expected from a treatment, but such cases are rare, reinforcing the general preference for RCTs.
Benefit-Risk Assessment: The ability to perform a comprehensive benefit-risk assessment is highlighted as a key advantage of RCTs. Such assessments are crucial for patient safety and are more robust when based on data from RCTs compared to single-arm studies.
Tasto, Christoph, by Bayer AG, Germany
Clinical trials in chronic kidney disease (CKD) often utilize composite endpoints comprising clinical events such as onset of end-stage kidney disease (ESKD) and initiation of kidney function replacement therapy (KFRT), along with a sustained large (e.g., ≥50%) decrease in glomerular filtration rate (GFR). Such events typically occur late in the disease course, resulting in large and long trials in which most participants do not contribute clinical events. More recently, the rate of GFR decline over time (i.e., GFR slope) has been suggested as a more efficient endpoint, and the EMA published a Draft Qualification opinion for GFR slope as a Surrogate Endpoint in CKD trials. This endpoint is considered particularly useful in early CKD stages as well as patient populations with slower CKD progression.
We introduce the use of hierarchical composite endpoints (HCEs) in clinical trials of CKD progression, emphasizing the potential to combine clinical events with the continuous GFR slope, while ranking all components according to clinical importance. Post-hoc analyses of several large CKD trials illustrate the application of the newly developed kidney HCE including bootstrap-based efficiency comparisons with the established endpoints.
The prioritization of clinical outcomes and ability to combine clinical outcomes with GFR slope make the HCE an attractive alternative endpoint that holistically captures CKD progression.
Kidney as a Filter: The kidney’s primary function is to filter waste products and excess substances from the blood, which are then excreted in the urine. This process helps maintain a stable balance of body chemicals. The basic filtering units of the kidney are the nephrons, and a healthy human kidney contains about 1 million nephrons.
Glomerular Filtration Rate (GFR): GFR is a critical measure of kidney function. It indicates how much blood is filtered by the glomeruli (the filtering units within the nephrons) each minute. A healthy GFR varies according to age, sex, and body size, but a GFR under 60 mL/min/1.73 m² for three months or more can indicate chronic kidney disease.
Effect of Kidney Disease: In CKD, the kidneys’ ability to filter blood effectively deteriorates over time. This decrease in kidney function leads to an accumulation of waste products in the blood, which can cause various health issues. Early stages of CKD often have no symptoms, making it difficult to detect without specific tests.
Challenges in CKD Trials: Traditional endpoints like time to dialysis or kidney transplantation are not feasible for most CKD trials due to their rarity and the long duration required to observe such outcomes. This has led researchers to look for surrogate endpoints that can predict these major outcomes more quickly and efficiently.
GFR Decline as a Surrogate Endpoint: A significant decline in GFR from baseline is used as a surrogate endpoint in CKD trials. It aims to predict the risk of reaching end-stage kidney disease (ESKD), requiring dialysis or transplantation. The use of a 30%, 40%, or 57% decline in GFR as a composite endpoint along with ESKD or related death allows for a more nuanced assessment of kidney function over time.
Hierarchical Composite Endpoints (HCEs): HCEs are a novel approach in CKD trials that combine continuous measures of GFR decline (the slope of GFR over time) with time-to-event outcomes (like reaching ESKD). This method allows for a comprehensive evaluation of treatment effects, considering both the rate of progression and the occurrence of significant clinical events.
Patient-wise Comparison in HCEs: In trials using HCEs, each participant in the treatment group is compared to each participant in the control group on a set of predefined outcomes. This comparison helps determine whether the treatment group experiences a slower progression of CKD or fewer adverse outcomes compared to the control group, providing a nuanced view of the treatment’s efficacy.
Hierarchy of Outcomes: HCEs organize outcomes based on their severity and relevance to the disease progression. In the context of CKD trials, this hierarchy might include, from most to least severe:
Combining Event and Continuous Endpoints: This approach integrates both time-to-event outcomes (such as cardiovascular death or initiation of dialysis) and continuous outcomes (like the slope of GFR decline over time). If none of the specified time-to-event outcomes occur within a predetermined timeframe (e.g., three years), the total slope of GFR decline is used for comparison.
Patient-Level Comparison: The comparison involves assessing each patient in the treatment group against each patient in the control group based on the predefined hierarchy of outcomes. This method helps determine the treatment’s efficacy in slowing disease progression or preventing severe outcomes.
Calculation of Odds: The method you described calculates the odds of a patient in the treatment group having a better outcome than a patient in the control group. This is done by considering wins (where the treatment group fares better), losses (where the control group fares better), and ties (where outcomes are similar).
Trial Example: In the finerenone trial for non-diabetic kidney disease, over 5000 patients were randomized to receive either finerenone or a placebo. The primary endpoint analysis showed a hazard ratio, and a slope analysis was conducted to assess the difference in GFR decline over three years.
Contribution of Components to HCE: In this trial, various components contributed differently to the overall analysis. For a significant portion of patients, the slope of GFR decline was the primary factor considered, indicating that for these patients, no severe time-to-event outcome occurred during the trial.
Efficiency and Power Analysis: By performing efficiency comparisons and using bootstrap resampling, the trial demonstrated how HCEs could provide powerful insights into treatment efficacy with varying sample sizes. This approach allows for a nuanced understanding of how treatments can impact the progression of CKD across a spectrum of outcomes.
Implications for Clinical Trials: The use of HCEs in CKD trials offers a more holistic view of treatment efficacy, incorporating both the prevention of severe outcomes and the modification of disease progression as measured by GFR decline. This method acknowledges the multifaceted nature of CKD and the need for treatments to address both acute and chronic aspects of the disease.
Enhanced Efficacy Analysis: HCEs allow for a prioritized analysis of outcomes, focusing on the most severe events first. This prioritization ensures that the most critical aspects of disease progression are considered foremost in the evaluation of treatment efficacy.
Integration of Diverse Data: By combining clinical event data with the slope of GFR decline, HCEs provide a comprehensive view of treatment effects. This integration is particularly important in diseases like CKD, where progression can vary widely among patients.
Alignment with Traditional Endpoints: The results from HCE analyses are well aligned with traditional endpoints in terms of effect direction and magnitude, reinforcing their validity and utility in clinical research.
Potential for Efficiency Gains: The analysis demonstrates that HCEs could achieve high statistical power with fewer patients than traditional endpoints, indicating significant efficiency gains. For example, achieving 90% power with only 1100 patients in the context of the Fidelio trial highlights the potential of HCEs to streamline clinical research.
Variability Across Trials: The efficiency and efficacy of HCEs compared to traditional endpoints and GFR slope analysis vary across trials. This variability underscores the importance of carefully considering the expected effects on different components of the HCE before its implementation.
Implementation Resources: The availability of R implementations and synthetic datasets for HCE analysis facilitates their adoption in clinical research. These resources, along with detailed publications on design considerations and the HCE framework, provide valuable guidance for researchers.
Regulatory Endorsement: The qualification opinion granted by the EMA for the use of HCEs and the slope endpoint in CKD trials underscores the regulatory acceptance and support for innovative analytical methods in drug development.
by Franco Castiblanco, Ana Carolina; Brannath, Werner, University of Bremen, Germany
The standard sample size formulae for cluster randomized trials assumed that the within cluster variability is homogeneous among clusters. In practice, however, the within cluster variability may not be constant and the standard formula may be bias. We propose a general sample size formula for cluster randomized trials when the within cluster variability is heterogeneous for both constant and variable cluster-wise sample sizes, and its simplification for two and three level trials. In addition, we propose estimators for the variance components based on conditional means. Furthermore, we conduct simulation study to investigate the behavior of the proposed sample size formula and variance components estimation and we compare it with the standard sample size formulae and the estimation of the variance components via multilevel linear models.
Multilevel trials are studies where the data is organized at more than one level, typically involving groups or clusters. In your context, clusters might be healthcare providers, within which patients receive treatment. These trials recognize the natural grouping of data in real-world settings and account for the potential correlation of outcomes within these groups.
A key point in your research is addressing the assumption of homogeneity in within-cluster variances. Traditional multilevel models often assume that all individuals within a cluster, and all clusters themselves, have similar variability in outcomes. Your work challenges this by proposing models that allow for heterogeneous variances, acknowledging that individuals and clusters may exhibit different levels of variability due to various factors like differing treatment responses or operational practices among healthcare providers.
You’ve introduced a mixed model with two fixed effects (baseline and treatment) and random components accounting for between-cluster and within-cluster variances. By allowing for random variation in the number of subjects per cluster and in the within-cluster variances, your model offers a more nuanced understanding of the data structure in multilevel trials. This approach can lead to more accurate and realistic estimates of treatment effects and the necessary sample sizes for detecting these effects.
The ICC is a crucial statistic in multilevel analysis, measuring the degree of correlation of outcomes within clusters. By incorporating heterogeneous variances into the ICC calculation, you provide a more accurate measure of the similarity of responses within clusters, which is essential for designing trials with sufficient power.
Your approach to hypothesis testing in this context involves defining a parameter of interest (the difference between intervention means) and approximating its variance. This approximation incorporates the heterogeneous variances and the distribution of cluster sizes, leading to a more accurate estimation of the sample size needed to detect a given effect size.
The design effect or variance inflation factor accounts for the clustering in the data, indicating how much larger the sample size must be compared to a simple random sample to achieve the same level of precision. Your method updates this calculation to reflect the heterogeneity in variances, providing a more tailored approach to determining sample size in multilevel trials.
Multilevel Trial Design: You’ve highlighted the importance of considering the hierarchical structure of data in trial design, where interventions might be applied at one level (e.g., healthcare providers) but outcomes measured at another (e.g., patients).
Heterogeneity in Variances: By introducing random effects models that account for heterogeneity in variances both within and between clusters, your methodology offers a more realistic representation of data variability. This approach acknowledges the diversity of responses within clusters and the differences in operational practices between clusters.
Intraclass Correlation Coefficients (ICCs): Calculating ICCs that reflect the true variability within and between clusters enhances the accuracy of sample size estimates and the power of the study. Your work on updating ICC calculations to include heterogeneous variances is a significant contribution to statistical methodologies for multilevel trials.
Design Effects and Sample Size Calculation: Your method for calculating design effects and, subsequently, required sample sizes by incorporating heterogeneity and variability across levels could lead to more efficient study designs. This is crucial for ensuring that studies are neither underpowered (risking type II errors) nor unnecessarily large (wasting resources).
Challenges with Estimation: You’ve acknowledged the difficulties in estimating parameters within this more complex model, especially with limited pilot data. The exploration of Bayesian estimation methods and interim analyses for sample size re-estimation reflects a thoughtful approach to overcoming these challenges.
Implications and Future Directions
Your work has important implications for the design and analysis of multilevel trials:
by Glimm, Ekkehard; Yau, Lillian, Novartis Pharma, Switzerland
To compare the effectiveness of different medical treatments in observational studies, or across different clinical studies, it is necessary to eliminate the influence of confounding factors if these are differently distributed in the treatment groups. A popular method for confounder adjustment is inverse probability weighting using propensity scores estimated from logistic regression as weights. While this method achieves “roughly matched” groups, determining when the matching is deemed “close enough” often sparks extensive debates.
In this talk, we propose a novel approach to the matching problem by reframing it as a constrained optimization problem. We explore the conditions under which a perfect match can be achieved, in the sense that the average value of confounders becomes identical in the treatment groups post-matching. We discuss the utilization of different objective functions, such as a function maximizing the effective sample size (ESS), to identify a specific set of weights that satisfy the given constraints. Depending on the chosen objective function, targeted optimizers like LPSOLVE or quadprog in R can be employed to efficiently determine these matching weights.
Our approach is closely related to the matching-adjusted indirect comparison approach by Signorovitch et al (2010). However, we go beyond their suggestion by not insisting on a specific functional form for the matching weights. In addition, the suggested approach can be applied to individual patient data from a treatment groups as well as in situations where in some groups only aggregated data is available.
In the talk, we will introduce the basic idea, apply the proposed approach to a dataset from an observational study, and compare the results with those obtained from propensity score matching.
PSM is a statistical technique widely used to control for confounding in observational studies, making it possible to estimate the causal effect of a treatment. By matching units (e.g., patients) from different treatment groups based on their propensity scores—their probability of receiving each treatment, given their covariates—researchers can create a “balanced” dataset that mimics some properties of a randomized controlled trial.
MAIC is particularly useful when comparing treatments across studies, especially when individual patient data (IPD) is available for one study but only aggregate data for the comparator. This technique adjusts the IPD to make the study populations comparable based on observed characteristics.
By framing these methods as a constrained optimization problem, you’re suggesting a novel approach that seeks to identify the best set of weights for each study participant, subject to constraints that ensure the weights lead to comparable groups across studies. This framework allows for a systematic and potentially more efficient way to achieve balance across studies, enhancing the comparability of treatment effects.
Your unified framework offers a promising avenue for advancing comparative effectiveness research, particularly when direct comparisons via randomized trials are not feasible. This approach could lead to more nuanced and accurate estimates of treatment effects across different populations and study designs. Future research might explore:
The summary effectively highlights the potential benefits of the OS method for matching in comparative studies, but also points out that the interpretation of the resulting weights may not be straightforward. The probabilistic understanding of such weights is crucial for ensuring that the conclusions drawn from the comparative analyses are not only statistically sound but also meaningful in a real-world context. Further research may be needed to better understand and articulate the implications of OS weights in terms of probabilities, especially in comparison to more traditional propensity score methods, which have a well-established probabilistic interpretation.
Dantzig GB (1947) Maximization of a linear function of variables subject to linear inequalities. In Activity Analysis of Production and Allocation, New York-London 1951 (Wiley & Chapman-Hall), pp. 339-347. edited by Koopmans TC.
Glimm E and Yau L (2022). Geometric approaches to assessing the numerical feasibility for conducting matching-adjusted indirect comparisons. Pharmaceutical Statistics; DOI: 10.1002/pst.2210.
Signorovitch JE, Wu EQ, Yu AP, et al. (2010). Comparative effectiveness without head- to-head Trials: A method for matching-adjusted indirect comparisons applied to psoriasis treatment with adalimumab or etanercept. Pharmacoeconomics; 28(10) 935-945.
Yau L and Glimm E (2022). maicChecks: Assessing numerical feasibility for conducting MAIC. R package version 0.1.2. https://CRAN R-project.org/package=maicChecks
by Bretz, Frank, Novartis, Switzerland
Since the release of the ICH E9(R1) document in 2019, the estimand framework has become a fundamental part of clinical trial protocols. In parallel, complex innovative designs have gained increased popularity in drug development. It is currently unclear, however, to which degree the estimand framework applies to these novel designs. For example, should a different estimand be specified for each sub-population (defined, for example, by cancer site) in a basket trial? Or can a single estimand focusing on the general population (defined, for example, by the positivity to a certain biomarker) be used? In the case of platform trials, should a different estimand be proposed for each drug investigated? We discuss relevant estimand considerations pertaining to different types of complex innovative designs. We consider trials that allow adding or selecting experimental treatment arms, modifying the control arm, and selecting or pooling populations. We also address the potentially data-driven, adaptive selection of estimands in an ongoing trial and disentangle certain statistical issues that pertain to estimation rather than to estimands, such as the borrowing of non-concurrent information.
ICH E9 Addendum: This guideline, which came out about five years ago, introduced the concept of an ‘estimate’ as a comprehensive description of the treatment effect that aims to answer the trial’s primary question. It’s about summarizing the expected outcomes at the population level if patients were to undergo different treatment conditions.
Relevance to CID: Although there is extensive literature on estimates, there seems to be less focus on how they apply within the realm of CIDs. Your presentation posits that the principles and methodologies of creating estimates remain relevant regardless of the trial design’s complexity.
Definition and Scope: CIDs refer to trial designs that may include novel aspects not previously used to establish efficacy in new drug applications. This novelty could be in terms of the design itself or the application of known design features to new indications.
FDA Guidance: The FDA has provided guidance on CIDs, indicating the agency’s openness to novel methodologies in confirmatory trials, so long as these new approaches can reliably demonstrate the effectiveness of an intervention.
Key Message: One of the take-home messages is that the thought process and principles behind the ICH E9 addendum are indeed applicable to CIDs. This assertion may seem straightforward, but there has been some confusion within the field regarding the extent to which the estimation framework can be integrated with CIDs.
Applicability: Your work aims to clarify that estimates should be a fundamental part of CIDs, ensuring that even in the face of complexity, the treatment effects are quantified in a manner that is consistent with regulatory expectations.
The process of defining ‘estimates’ is crucial and should precede the selection of analytical methods in complex, innovative trial designs like basket trials. The case study illustrates the necessity of considering different attributes of ‘estimates’ and ensuring that the analysis approach aligns with the defined ‘estimates’. This structured approach helps ensure that the conclusions drawn from the study are robust and reflective of the trial’s objectives, despite the complexities introduced by the innovative design.
Attributes of Estimates
Attributes of Estimates
Changes in Standard of Care: In long-lasting trials, such as master protocol designs, it’s possible that the standard of care, which serves as the control arm, might change. This scenario does necessitate a change in the estimand since the comparator is no longer the same.
Distinct Estimates for Different Stages: If the control arm changes during the trial, you effectively have two distinct sets of estimands — one for each stage of the trial. The treatment attribute of the estimand would need to reflect this change.
Dedicated Estimate Discussions: For each experimental treatment compared with the initial control arm, a dedicated discussion is needed, leading to distinct estimands for each treatment-control comparison.
Stage-Specific Estimates: With the change of the control arm during the trial, you argue for considering separate estimands for each stage rather than combining them, as combining could lead to interpretational difficulties.
changes in the treatment landscape can impact both the patient population and the occurrence of intercurrent events over the course of the study. Here’s a detailed breakdown of the implications for estimands in such scenarios:
Phase Three basket trial designed to assess the safety and efficacy of a marketed product for new indications across three different types of allergies.
Multiple Sub-Studies: The trial includes different sub-studies for each type of allergy, each requiring its own estimand and specific discussion, due to the unique characteristics of each allergy type.
Protocol Complexity: One of the main challenges is incorporating all necessary details into a single protocol without overwhelming the document, while also ensuring that the protocol includes all required elements for each estimand.
Operational Efficiency: Conducting a single trial with multiple sub-studies offers logistical simplicity and potential for improved data quality, as patients are treated concurrently at the same centers.
Data Quality: The standardization of trial procedures across different sub-studies can lead to better comparability and quality of data.
Consistent Objectives: Despite the innovative design, the objectives for the basket trial are the same as they would be for three separate trials.
Distinct Discussions: The team recognized the need for separate and distinct discussions for each of the three sub-studies, leading to different estimands for each type of allergy.
Protocol Development: A significant challenge was how to integrate the details of the three sub-studies into one master protocol without making it overly complex.
Estimand Attributes: Although the investigational product and comparator were consistent across sub-studies, the outcome variables differed due to the unique characteristics and measurement methods for each allergy type.
List of Intercurrent Events: The intercurrent events considered were treatment-related rather than population-related, implying they were consistent across all sub-studies.
Common Events: These included discontinuation of the investigational product, use of prohibited medications, and lack of efficacy.
Population Differences: Although all sub-studies fall under the same protocol, the patient populations differ, as each sub-study targets a different type of allergy with its own inclusion and exclusion criteria.
Outcome Variables: Each type of allergy has specific ways of being measured, leading to different outcome variables for each sub-study.
Intercurrent Events: Treatment-related intercurrent events are standardized across the sub-studies, as they are not population-specific. These could include discontinuation of the investigational product, prohibited medication use, or lack of efficacy.
Trial Participation Encouragement: Even if no further treatment was offered, participants were encouraged to stay in the trial, affecting the imputation strategy for the data analysis.
Different Outcome Variables: Due to the nature of the different allergies, each sub-study needed to measure disease-specific outcomes, necessitating different variables for each estimand.
Data Analysis and Imputation: The data analysis plan, including the approach to missing data imputation, had to be carefully considered to handle the potential differences in intercurrent events and to maintain the integrity of each sub-study’s estimands.
Collignon, Schiel, Burman, Rufibach, Posch, Bretz (2022) Estimands and Complex Innovative Designs. Clinical Pharmacoloqv & Therapeutics 112(6). 1183-1190
by Koenig, Franz; Posch, Martin; Zehetmayer, Sonja, Medical University of Vienna, Österreich
Platform trials have been proposed where several randomized clinical trials with related objectives are combined to a single trial with a joint master protocol to improve efficiency by reducing costs and saving time. Treatment arms can enter and leave the study at different times during its conduct, possibly depending on previous results or available resources and the total number of treatment arms in a platform trial is not fixed in advance. One big advantage of platform trials is the sharing of one or several control arms.
As many hypotheses will be eventually tested, we will discuss whether an adjustment for multiplicity is indeed needed in the context of platform trials. In addition to the two extreme positions of no adjustment at all or requiring strict control of the familywise error rate, control of other error rates such as the False Discovery Rate (FDR) will be scrutinized. Particular attention has to be given as the total number of hypotheses being tested is usually unknown in the planning phase and interim analyses might complicate matters further. Another issue is how the information of already collected data might impact the planning of treatments to be added.
We compare the impact on sample sizes needed and power also depending on which error rate should be controlled, e.g., the experiment-wise error rate or control of the FDR.
In summary, the discussion in the image revolves around how different clinical trial designs handle the problem of multiplicity and whether they incorporate methods to control the overall Type I error rate. These considerations are crucial when designing trials and analyzing their results, as they affect the reliability and validity of the conclusions drawn from the trial data.
Traditional Approach: Traditionally, in separate trials, multiplicity adjustments were not typically made because each trial was treated independently.
Contemporary Views on Platform Trials: In the current era, particularly in platform trials where multiple sub-studies are conducted under one overarching protocol, the approach to multiplicity is more nuanced. The question is whether to adjust for multiplicity when sub-studies share control arms or when multiple hypotheses are tested within a single trial framework.
Inferential Independence: A key term in this discussion is whether hypotheses are inferentially independent. If the truth of one hypothesis is unrelated to the truth of another, they are considered inferentially independent. This distinction affects whether multiplicity adjustments are deemed necessary.
Statistical Practice: Over the last few years, there’s been a leaning towards not adjusting for multiplicity in the context of platform trials, treating each sub-study as independent, which aligns with the practice of individual trial analysis.
Guideline References: You mentioned referring back to EMA guidelines which emphasize the control of the familywise error rate to avoid false-positive conclusions. This point stresses that while the industry may lean towards a more pragmatic approach, regulatory guidelines still uphold the principle of controlling the Type I error rate across a family of tests.
Risk Quantification: You introduced the idea of a risk framework to quantify the risk of false positives by controlling the overall Type I error rate. This approach would provide a structured way to assess the cumulative risk of false positives across a range of trials.
Best Use of Resources: The argument is about where the focus should be — should more resources be devoted to reducing the chance of a false positive (Type I error) or to increasing the power of the trial?
Efficiency Gains: You suggest that platform trials may offer higher efficiency by sharing control groups, allowing for cost savings or the possibility of reallocating those savings to improve other aspects of the trial, such as maintaining power.
Separate Trials: When conducting separate trials, each with its own control group, the power to detect an effect typically remains constant regardless of the number of treatments being tested.
Platform Trials: In a platform trial, the power can vary. With optimal allocation and shared controls, the power can be increased even if multiplicity adjustments are made. This leads to the idea that the efficiency gains from a platform trial could be used to offset the power loss from multiplicity adjustments.
Adjusting for Multiplicity: Even after adjusting for multiplicity in a platform trial, you can often retain higher power compared to separate trials. The question raised is why not leverage the benefits of a platform trial to allow for a less stringent correction for multiplicity?
Alternative Error Control: Instead of traditional Type I error control, you could adopt a slightly higher error rate or utilize other forms of error control, such as false discovery rate control.
Dynamic Nature: Platform trials introduce complexity since new arms can be added over time, which complicates the use of standard multiple testing procedures that typically require a predefined number of hypotheses.
Online Methods: Methods that allow for the use of data as it is collected, without pre-specifying the number of trials, are proposed. However, these methods might not allocate the same significance level across all hypotheses, which can be problematic.
False Discovery Rate: The FDR depends on the number of hypotheses tested and the number falsely rejected. Conventional methods to control the FDR assume a fixed number of hypotheses and simultaneous availability of all p-values.
Online FDR Control: New methods, termed “online control,” are required for platform trials. These may necessitate a predefined order of testing and decisions based on data available at each step.
Interim Analysis: With interim analyses, additional complexities arise. For example, if a treatment is stopped early for efficacy in an interim analysis, subsequent p-values and significance levels may need to be adjusted.
You’re addressing the complexities of incorporating new data into ongoing trials and the statistical methodologies that can benefit from such data. The central theme is the innovative use of shared control data to enhance the efficiency and robustness of platform trials. You also stress the importance of maintaining strict control over the Type I error rate, while acknowledging that platform trials offer unique opportunities to adapt and optimize based on the data as the trial progresses.
Integrating New Treatment Arms: In platform trials where new treatments may be added over time, you discuss how to make use of the data already collected from existing arms.
Frequency-Based Decisions: You mention the potential use of local control data models, which will be discussed later. This approach might involve leveraging data from control arms that are shared across different substudies within the platform trial.
Challenges with New Data: Using non-current control data presents difficulties in strictly controlling the Type I error rate, which must then rely on certain assumptions.
Modifying Testing Strategies: The data already collected can be used to modify the testing strategy or even to recalibrate the sample size calculations for efficiency.
Quality of Data: You emphasize that the data collected within the platform trial is of high quality due to consistent endpoints, estimands, and trial procedures across substudies, making it more valuable than historical control data.
Average Power: The average power of the trial looks good, not quite at the 80% mark but around 77%. However, the variability in power depends on the number of treatments included dynamically.
Sample Size Variability: Including new treatments dynamically, such as after every set number of controls, results in a broad range for the resulting sample size, which can pose challenges for budgeting and planning.
Average Power vs. Individual Treatment Power: While average power is an important metric, it’s not the main characteristic of interest since it only applies to certain situations. There’s a notable drop in power after the first few treatments, suggesting that the decision to use data from the trial itself for updates needs to be made carefully.
Adaptive Design Considerations: With adaptive design, you’re not just conducting a final analysis for each treatment arm but also incorporating interim analyses.
Adjusting Alpha Levels: When hypotheses are rejected during interim analyses, the alpha level for subsequent tests may be increased, providing a mechanism to maintain the overall Type I error rate.
Sample Size Reassessment: Since interim analyses might involve unblinding data, there’s an opportunity to reassess and possibly update the sample size based on conditional power arguments.
Using Control Data: For conditional power calculations, there’s a consideration whether to use only concurrent control data or all control data collected so far. The latter can provide a more precise variance estimate under certain assumptions.
Power Enhancement: By utilizing conditional power calculations, you can potentially increase the power above the target threshold of 80%, even if you’re controlling for the false discovery rate.
Power Curve Dependence: The power curve is dependent on the number of treatments tested in the trial. With more treatments, the power to detect an effect dissipates, which is illustrated by different power curves depending on the number of treatments.
Power Discrepancy: There’s a contrast in power between having multiple treatments versus just one. With multiple treatments, power is higher compared to when only one treatment is tested.
by Bofill Roig, Marta
Medical University of Vienna, Austria
Platform trials evaluate the efficacy of multiple treatments, allowing for late entry of the experimental arms and enabling efficiency gains by sharing controls. The control data is divided into concurrent (CC) and non-concurrent controls (NCC) for arms that join the trial later. Using NCC for treatment-control comparisons can improve the power but might cause biased estimates if there are time trends. Several approaches have been proposed to utilise NCC while aiming to maintain the integrity of the trial. Frequentist model-based approaches adjust for potential bias by adding time as a covariate to the regression model. The Time Machine considers a Bayesian generalised linear model that uses a smoothed estimate for the control response over time. The Meta-Analytic-Predictive prior approach estimates the control response by combining the CC data with a prior distribution derived from the NCC data.
In this talk, we review the analysis approaches proposed for
incorporating NCC in the treatment-control comparisons of platform
trials. We investigate the operating characteristics of the considered
approaches by means of a simulation study, focusing on assessing the
impact of the overlap between treatment arms and the strength of the
time trend on the performance of the evaluated models. We furthermore
present the R-package NCC for the design and analysis of
platform trials. We illustrate the use of the above-mentioned approaches
and show how to perform simulations in various settings through the
NCC package.
Platform Trials: These are multi-stage trials where new treatment arms enter and exit at different times. Unlike traditional trials where one treatment is compared to one control, platform trials allow for multiple treatments to be evaluated simultaneously against a shared control group.
Types of Control Data: You differentiate between concurrent controls (patients randomized while the treatment is being evaluated) and non-concurrent controls (patients randomized before the treatment entered the trial).
Bias Due to Time Trends: In platform trials, time trends within the trial may introduce bias. The critical question is whether non-concurrent controls can be used in analysis and, if so, how to do it without introducing bias.
Separate Approach: This method uses only concurrent data from the period when the treatment under evaluation is active in the platform.
Pooled Approach: This approach naively pools all non-concurrent control data together without adjustments.
Model-based Approaches: Recent methods propose using regression models that adjust for time to incorporate non-concurrent controls, aiming to mitigate bias from time trends.
Frequency Regression Model: To illustrate this method, you use an example where the trial duration is divided into two periods: before and after the second treatment arm enters the trial. The goal is to compare treatment arm 2 (R2) against the control using both concurrent and non-concurrent controls.
Bayesian Model “Time Machine”: This model is not described in the text but is mentioned as another method you will focus on.
You conclude by stating that the study-wise Type I Error (T1E) rate control is not directly applicable to platform trials, especially perpetual ones.
Control of Familywise Error Rate (FWER): Controlling the FWER at the treatment or substudy level is a practical approach.
Independence in Hypotheses: You question the consensus on what is considered “independent” in the context of platform trials.
Online False Discovery Rate (FDR): Online FDR is suggested as suitable for exploratory platform trials.
Leveraging Past Data: The structure of platform trials allows for the use of past observations in planning new arms, though the overlap of treatments influences this approach.
Challenges with Early Treatments: Planning for very early treatments in platform trials is more complex, making sample size reassessment a potentially good strategy.
The trial is divided into two periods, with Arm 1 present in both, while Arm 2 enters in Period 2. The control arm is also present throughout. The regression method aims to account for the effect of time on treatment effects without assuming any interaction between time and treatment—meaning the time effect is uniform across all groups.
The hypothesis testing problem is centered around the treatment effect of Arm 2 versus the control arm, with the null hypothesis \(H_0: \theta_2 = 0\) stating that there is no effect, and the alternative hypothesis \(H_1: \theta_2 > 0\) suggesting a positive effect.
The model is a regression-based approach where the expected outcome \(E(Y)\) is modeled as a function of several terms:
The model assumes that the period effect is additive on the model scale and constant within each period. It’s important to note that there is no time-treatment interaction assumed, meaning the time effect is applied equally to all groups in the platform.
The diagram represents a platform trial over two periods with different arms:
Sample means are denoted for each arm and period:
The treatment effect estimator for Arm 2 (\(\hat{\theta}_2\)) is calculated using the difference between the sample mean of Arm 2 in Period 2 (\(\bar{y}_{2,2}\)) and a model-based estimate of the control response in Period 2 (\(\bar{y}^*_{0,2}\)).
The model-based estimate of the control response in Period 2 (\(\bar{y}^*_{0,2}\)) is a weighted average where:
The weight \(q\) is calculated using the harmonic mean of the sample sizes of the control arm in both periods and Arm 1 in both periods, where \(n_{0,1}, n_{0,2}, n_{1,1},\) and \(n_{1,2}\) represent the sample sizes for the control arm and Arm 1 in Periods 1 and 2, respectively.
This model aims to create smoothing in the control response over time to account for temporal drifts that could affect the outcome.
The extracted text from the image is as follows:
Bofill Roig, M., Krotka, P., Burman, C.-F., Glimm, E., Gold, S. M., Hees, K., Jacko, P., Koenig, F., Magirr, D., Mesenbrink, P., Viele, K., & Posch, M. (2022). On model-based time trend adjustments in platform trials with non-concurrent controls. BMC Medical Research Methodology.
Krotka, P., Hees, K., Jacko, P., Magirr, D., Posch, M., & Bofill Roig, M. (2023). NCC: An R-package for analysis and simulation of platform trials with non-concurrent controls. SoftwareX.
Berry, S. M. (2022). The Bayesian Time Machine: Accounting for temporal drift in multi-arm platform trials. Clinical Trials.
Marschner, I. C., & Schou, I. M. (2022). Analysis of adaptive platform trials using a network approach. Clinical Trials. ©
Lee, K. M., & Wason, J. (2020). Including non-concurrent control patients in the analysis of platform trials: is it worth it? BMC Medical Research Methodology.
Bofill Roig, M., et al. (2023). On the use of non-concurrent controls in platform trials: A scoping review. Trials.
Bennett, M., & Mander, A. P. (2020). Designs for adding a treatment arm to an ongoing clinical trial. Trials.
by Grouven, Ulrich; Skipka, Guido, IQWiG, Germany
The performance of subgroup analyses is an important component in the preparation of HTA reports by the Institute for Quality and Efficiency in Health Care (IQWiG). The legal requirements explicitly oblige the Institute to conduct subgroup analyses with regard to age, sex, and disease severity [1].
The historical development of the methodological procedure for conducting subgroup analyses is outlined and the current methodological procedure described in IQWiG’s methods paper is presented [2]. The starting point is to consider possible heterogeneity between subgroups and to perform an interaction test based on Cochran’s Q statistic [3]. An alternative is to perform an F-test in the context of a meta-regression [4].
The concrete stepwise procedure for performing subgroup analyses and for deriving benefit statements is described in detail and illustrated with concrete examples from IQWiG’s benefit assessment. Possible problems and limitations are discussed.
Testing for Interaction (Heterogeneity): The first step involves using statistical tests (e.g., the Cochran Q test) to identify significant differences in treatment effects across subgroups. A standard significance level of 5% is used. If the test is not significant, it suggests that the treatment effect is consistent across subgroups, and thus, no further subgroup analysis is necessary. If the test is significant, it indicates potential differences in treatment effects across subgroups that warrant further analysis.
Minimum Requirements for Subgroup Analysis: If heterogeneity is significant, the analysis proceeds with additional criteria to ensure reliability. These criteria include having at least 10 people in each subgroup, at least 10 events for binary data and survival times, and a significant and relevant effect in at least one subgroup. These pragmatic rules aim to ensure sufficient reliability and relevance of the results.
Combining Conclusions Across Outcomes: After determining the significance of subgroup effects and ensuring that minimum requirements are met, separate conclusions for subgroups are derived based on outcomes. These conclusions are then combined into an overall conclusion across all relevant endpoints.
Historical Perspective and Empirical Investigation: You’ve also highlighted a shift from a previous, more complex algorithm for subgroup analysis that used two threshold values (5% and 20%) to the current, simplified approach. This change was motivated by the desire to reduce complexity and resource requirements while maintaining the ability to detect relevant interactions. An empirical investigation into dossier assessments in 2015, involving 100 separate analyses, revealed that the simplified approach is effective, as only a small fraction of cases indicated potential subgroup effects.
The remarks point to the complexity that arises when dealing with more than two subgroups. It mentions the need for pairwise statistical tests and the approach to take when pairs are not statistically significant, suggesting that they could be summarized into one group if it’s meaningful. Additionally, it discusses the difficulties in interpreting interactions between more than one subgroup characteristic, indicating that separate analyses might be necessary for combined subgroups, though such analyses are rarely available and decisions must be made on a case-by-case basis.
when dealing with more than two subgroups, the research team must determine whether it is meaningful and medically sensible to combine certain subgroups, based on the results of statistical tests. When there is effect modification across more than one subgroup characteristic, such as age and gender, the interpretation of these interactions becomes complex. Separate analyses for combined subgroups might be necessary, but such detailed analyses are rarely available due to their complexity and the infrequency with which these situations occur. Therefore, decisions are made on a case-by-case basis.
by Röver, Christian1; Kramer, Malte2; Friede, Tim1
1University Medical Center Göttingen; 2German Rheumatism Research Center, Berlin
Meta-analyses commonly include reports of additional subgroup-analyses, which may be useful as a sensitivity analysis, or to investigate whether a subset alone may provide sufficient (or consistent) evidence. Technically, the default procedure often is to perform separate analyses, overall and for data subsets.
Given that meta-analysis methods often perform poorly when only few studies are involved, matters get worse when study numbers are reduced to subsets. As performance issues commonly relate to the estimation of between-study heterogeneity (while the main focus is on overall effects), a promising approach may be to reduce model complexity by considering a single, common heterogeneity parameter. On the technical side, this means approaching subgroup analyses as a meta-regression problem (Röver and Friede, 2023). This methodological approach has been advocated previously (Dias et al., 2013), but is still rarely implemented in practice (Donegan et al., 2015).
We investigate the alternative approaches from both frequentist and Bayesian viewpoints and demonstrate the potential performance gains, as well as sensitivity to potential assumption violations using simulations as well as a larger number of published meta-analyses from the Cochrane Database of Systematic Reviews.
broader rationale for frequently examining subgroups, highlighting the importance of such analyses in:
The decision on which analytical method to use may depend on the specific research question at hand. Whether a joint analysis of all studies, separate analyses for each subgroup, or a meta-regression approach is most suitable can depend on the assumptions made about the commonality of effects and heterogeneity.
The different models you mentioned—joint analysis, separate subgroup analysis, and meta-regression—reflect varying assumptions about the data. These models differ by the number of parameters they estimate:
These considerations determine the underlying data model and whether the approach is Bayesian or frequentist, which are methodologies for statistical inference. The key is that these considerations focus on the likelihood of the data model itself, not on the inferential framework used to draw conclusions from the data.
You also indicated that the meta-regression approach is not new and has been previously suggested and advocated in the literature.
Regarding the results illustration, you’ve mentioned the comparison of Bayesian and frequentist methods, denoting them in different colors, and how the meta-regression leads to shorter confidence intervals and potentially more accurate heterogeneity estimates compared to separate analyses. The frequentist approach, in particular, sometimes yields peculiar estimates for heterogeneity.
To assess how these methods compare in real-world settings, you’ve mentioned an empirical analysis using a database of systematic reviews. The goal was to examine the actual application of overall analyses versus subgroup analyses, particularly focusing on binary endpoints and log odds ratios. The median sizes of the overall meta-analyses versus the subgroups within these analyses were contrasted, highlighting that subgroups are inherently smaller.
Lastly, you hoped to see fewer estimates of zero heterogeneity, which would imply better estimation of variability in effect sizes. For Bayesian analyses, this expectation was not consistently met, but the frequentist meta-regression approach seemed to show a reduction in zero heterogeneity estimates, which can be considered a success in terms of the methodology’s ability to capture true heterogeneity.
Malte Kramer. Between-trial heterogeneity in meta-regression. MSc thesis, Universität Göttingen, 2022. https : / / medstat . umg. eu/en/overview/publications/diploma —master—and—bachelor/.
See Clinical Trial Diagnostic Study
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
## system code page: 65001
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] mindr_1.3.2 xtable_1.8-4 survival_3.5-7
## [4] missForest_1.5 CALIBERrfimpute_1.0-7 mice_3.16.0
## [7] mosaic_1.8.4.2 mosaicData_0.20.3 ggformula_0.10.4
## [10] Matrix_1.6-2 lattice_0.21-9 rpact_3.3.4
## [13] clinUtils_0.1.4 htmltools_0.5.5 Hmisc_5.1-0
## [16] inTextSummaryTable_3.3.0 gtsummary_1.7.2 kableExtra_1.3.4.9000
## [19] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
## [22] dplyr_1.1.2 purrr_1.0.2 readr_2.1.4
## [25] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4
## [28] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.15.0 jsonlite_1.8.5 shape_1.4.6
## [4] magrittr_2.0.3 jomo_2.7-6 ggstance_0.3.6
## [7] nloptr_2.0.3 farver_2.1.1 rmarkdown_2.22
## [10] ragg_1.2.5 vctrs_0.6.3 minqa_1.2.5
## [13] askpass_1.1 base64enc_0.1-3 webshot_0.5.4
## [16] itertools_0.1-3 curl_5.0.1 haven_2.5.2
## [19] broom_1.0.5 Formula_1.2-5 mitml_0.4-5
## [22] sass_0.4.6 bslib_0.5.0 htmlwidgets_1.6.2
## [25] plyr_1.8.8 cachem_1.0.8 gt_0.9.0
## [28] uuid_1.1-0 mime_0.12 lifecycle_1.0.3
## [31] iterators_1.0.14 pkgconfig_2.0.3 R6_2.5.1
## [34] fastmap_1.1.1 shiny_1.7.4 digest_0.6.31
## [37] colorspace_2.1-0 textshaping_0.3.6 crosstalk_1.2.0
## [40] randomForest_4.7-1.1 fansi_1.0.4 timechange_0.2.0
## [43] httr_1.4.6 polyclip_1.10-4 compiler_4.3.2
## [46] rngtools_1.5.2 fontquiver_0.2.1 withr_2.5.0
## [49] htmlTable_2.4.1 backports_1.4.1 highr_0.10
## [52] ggforce_0.4.1 pan_1.6 MASS_7.3-60
## [55] openssl_2.0.6 gfonts_0.2.0 tools_4.3.2
## [58] foreign_0.8-85 zip_2.3.0 httpuv_1.6.11
## [61] nnet_7.3-19 glue_1.6.2 nlme_3.1-163
## [64] promises_1.2.0.1 grid_4.3.2 checkmate_2.2.0
## [67] cluster_2.1.4 reshape2_1.4.4 generics_0.1.3
## [70] gtable_0.3.3 labelled_2.12.0 tzdb_0.4.0
## [73] data.table_1.14.8 hms_1.1.3 xml2_1.3.4
## [76] utf8_1.2.3 ggrepel_0.9.3 foreach_1.5.2
## [79] pillar_1.9.0 later_1.3.1 splines_4.3.2
## [82] tweenr_2.0.2 tidyselect_1.2.0 fontLiberation_0.1.0
## [85] knitr_1.43 fontBitstreamVera_0.1.1 gridExtra_2.3
## [88] svglite_2.1.1 crul_1.4.0 xfun_0.39
## [91] mosaicCore_0.9.2.1 DT_0.28 stringi_1.7.12
## [94] boot_1.3-28.1 yaml_2.3.7 evaluate_0.21
## [97] codetools_0.2-19 httpcode_0.3.0 officer_0.6.2
## [100] gdtools_0.3.3 cli_3.6.1 rpart_4.1.21
## [103] systemfonts_1.0.4 munsell_0.5.0 jquerylib_0.1.4
## [106] Rcpp_1.0.10 ellipsis_0.3.2 doRNG_1.8.6
## [109] lme4_1.1-33 glmnet_4.1-7 mvtnorm_1.2-2
## [112] viridisLite_0.4.2 broom.helpers_1.13.0 scales_1.2.1
## [115] ggridges_0.5.4 crayon_1.5.2 flextable_0.9.2
## [118] rlang_1.1.1 cowplot_1.1.1 rvest_1.0.3